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ABSTRACT 

As font is one of the core design concepts, automatic font 
identification and similar font suggestion from an image or 
photo has been on the wish list of many designers. We 
study the Visual Font Recognition (VFR) problem |^, and 
advance the state-of-the-art remarkably by developing the 
DeepFont system. First of all, we build up the first avail¬ 
able large-scale VFR dataset, named AdobeVFR^ consisting 
of both labeled synthetic data and partially labeled real- 
world data. Next, to combat the domain mismatch between 
available training and testing data, we introduce a Convo¬ 
lutional Neural Network (CNN) decomposition approach, 
using a domain adaptation technique based on a Stacked 
Convolutional Auto-Encoder (SCAB) that exploits a large 
corpus of unlabeled real-world text images combined with 
synthetic data preprocessed in a specific way. Moreover, we 
study a novel learning-based model compression approach, 
in order to reduce the DeepFont model size without sacrific¬ 
ing its performance. The DeepFont system achieves an ac¬ 
curacy of higher than 80% (top-5) on our collected dataset, 
and also produces a good font similarity measure for font 
selection and suggestion. We also achieve around 6 times 
compression of the model without any visible loss of recog¬ 
nition accuracy. 

Categories and Subject Descriptors 

L4.7 [Image Processing and Computer Vision]: Fea¬ 
ture measurement; L4.10 [Image Processing and Com¬ 
puter Vision]: Image Representation; 1.5 [Pattern Recog¬ 
nition]: Classifier design and evaluation 

General Terms 

Algorithms, Experimentation 

Keywords 

Visual Eont Recognition; Deep Learning; Domain Adapta¬ 
tion; Model Compression 
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1. INTRODUCTION 

Typography is fundamental to graphic design. Graphic 
designers have the desire to identify the fonts they encounter 
in daily life for later use. While they might take a photo of 
the text of a particularly interesting font and seek out an ex¬ 
pert to identify the font, the manual identification process 
is extremely tedious and error-prone. Several websites allow 
users to search and recognize fonts by font similarity, includ¬ 
ing Identifont, MyEonts, WhatTheEont, and Eontspring. All 
of them rely on tedious humans interactions and high-quality 
manual pre-processing of images, and the accuracies are still 
unsatisfactory. On the other hand, the majority of font se¬ 
lection interfaces in existing softwares are simple linear lists, 
while exhaustively exploring the entire space of fonts using 
an alphabetical listing is unrealistic for most users. 

Effective automatic font identification from an image or 
photo could greatly ease the above difficulties, and facili¬ 
tate font organization and selection during the design pro¬ 
cess. Such a Visual Eont Recognition (VER) problem is 
inherently difficult, as pointed out in |^, due to the huge 
space of possible fonts (online repositories provide hundreds 
of thousands), the dynamic and open-ended properties of 
font classes, and the very subtle and character-dependent 
difference among fonts (letter endings, weights, slopes, etc.). 
More importantly, while the popular machine learning tech¬ 
niques are data-driven, collecting real-world data for a large 
collection of font classes turns out to be extremely difficult. 
Most attainable real-world text images do not have font label 
information, while the error-prone font labeling task requires 
font expertise that is out of reach of most people. The few 
previous approaches are mostly from the 

document analysis standpoint, which only focus on a small 
number of font classes, and are highly sensitive to noise, 
blur, perspective distortions, and complex backgrounds. In 
the authors proposed a large-scale, learning-based solu¬ 
tion without dependence on character segmentation or OCR. 
The core algorithm is built on local feature embedding, local 
feature metric learning and max-margin template selection. 
However, their results suggest that the robustness to real- 
world variations is unsatisfactory, and a higher recognition 
accuracy is still demanded. 

Inspired by the great success achieved by deep learning 
models in many other computer vision tasks, we de¬ 
velop a VER system for the Roman alphabets, based on 
the Convolutional neural networks (CNN), named DeepFont 
Without any dependence on character segmentation or con¬ 
tent text, the DeepEont system obtains an impressive per¬ 
formance on our collected large real-word dataset, covering 
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Figure 1: (a) (b) Successful VFR examples with the DeepFont system. The top row are query images 

from VFR_real_test dataset. Below each query, the results (left column: font classes; right column: images 
rendered with the corresponding font classes) are listed in a high-to-low order in term of likelihoods. The 
correct results are marked by the red boxes, (c) More correctly recognized real-world images with DeepFont. 


an extensive variety of font categories. Our technical con¬ 
tributions are listed below: 

• AdobeVFR Dataset A large set of labeled real-world 
images as well as a large corpus of unlabeled real-world 
data are collected for both training and testing, which 
is the hrst of its kind and is publicly released soon. 
We also leverage a large training corpus of labeled syn¬ 
thetic data augmented in a specihe way. 

• Domain Adapted CNN It is very easy to generate 
lots of rendered font examples but very hard to obtain 
labeled real-world images for supervised training. This 
real-to-synthetic domain gap caused poor generaliza¬ 
tion to new real data in previous VFR methods |^. 
We address this domain mismateh problem by lever¬ 
aging synthetic data to obtain effective classification 
features, while introducing a domain adaptation tech¬ 
nique based on Stacked Convolutional Auto-Encoder 
(SCAB) with the help of unlabeled real-world data. 

• Learning-based Model Compression We introduce 
a novel learning-based approach to obtain a losslessly 
eompressible models for a high compression ratio with¬ 
out sacrificing its performance. An exact low-rank con¬ 
straint is enforced on the targeted weight matrix. 

Fig. a shows successful VFR examples using DeepFont. In 
(a)(b), given the real-world query images, top-5 font recog¬ 
nition results are listed, within which the ground truth font 
classes are marked oul[j More real-world examples are dis- 

^Note that the texts are input manually for rendering pur¬ 
poses only. The font recognition process does not need any 
content information. 


Table 1: Comparison of All VFR Datasets 


Dataset name 

Source 

Label? 

Purpose 

Size 

Class 

VFRWild325 4 

Real 

Y 

Test 

325 

93 

VFR reaLtest 

Real 

Y 

Test 

4, 384 

617 

VFR reaLu 

Real 

N 

Train 

197, 396 

/ 

VFR syn train 

Syn 

Y 

Train 

2,383, 000 

2, 383 

VFR_syn_val 

Syn 

Y 

Test 

238, 300 

2, 383 


played in (c). Although accompanied with high levels of 
background clutters, size and ratio variations, as well as per¬ 
spective distortions, they are all correctly recognized by the 
DeepFont system. 

2. DATASET 

2.1 Domain Mismatch between Synthetic and 
Real-World Data 

To apply machine learning to VFR problem, we require 
realistic text images with ground truth font labels. How¬ 
ever, such data is scarce and expensive to obtain. More¬ 
over, the training data requirement is vast, since there are 
hundreds of thousands of fonts in use for Roman characters 
alone. One way to overcome the training data challenge is to 
synthesize the training set by rendering text fragments for 
all the necessary fonts. However, to attain effective recog¬ 
nition models with this strategy, we must face the domain 
mismatch between synthetic and real-world text images |^. 


























For example, it is common for designers to edit the spacing, 
aspect ratio or alignment of text arbitrarily, to make the 
text fit other design components. The result is that charac¬ 
ters in real-world images are spaced, stretched and distorted 
in numerous ways. For example. Fig. (a) and (b) depict 
typical examples of character spacing and aspect ratio differ¬ 
ences between (standard rendered) synthetic and real-world 
images. Other perturbations, such as background clutter, 
perspective distortion, noise, and blur, are also ubiquitous. 
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2.2 The AdobeVFR Dataset 

Collecting and labeling real-world examples is notoriously 
hard and thus a labeled real-world dataset has been absent 
for long. A small dataset VFRWild325 was collected in |^, 
consisting of 325 real-world text images and 93 classes. How¬ 
ever, the small size puts its effectiveness in jeopardy. 

Chen et. al. in selected 2,420 font classes to work on. 
We remove some script classes, ending up with a total of 
2,383 font classes. We collected 201,780 text images from 
various typography forums, where people post these images 
seeking help from experts to identify the fonts. Most of them 
come with hand-annotated font labels which may be inaccu¬ 
rate. Unfortunately, only a very small portion of them fall 
into our list of 2,383 fonts. All images are first converted 
into gray scale. Those images with our target class labels 
are then selected and inspected by independent experts if 
their labels are correct. Images with verified labels are then 
manually cropped with tight bounding boxes and normal¬ 
ized proportionally in size, to be with the identical height 
of 105 pixels. Finally, we obtain 4,384 real-world test im¬ 
ages with reliable labels, covering 617 classes (out of 2,383). 
Compared to the synthetic data, these images typically have 
much larger appearance variations caused by scaling, back¬ 
ground clutter, lighting, noise, perspective distortions, and 
compression artifacts. Removing the 4,384 labeled images 
from the full set, we are left with 197,396 unlabeled real- 
world images which we denote as VFR_reaLu. 

To create a sufficiently large set of synthetic training data, 
we follow the same way in to render long English words 
sampled from a large corpus, and generate tightly cropped, 
gray-scale, and size-normalized text images. For each class, 
we assign 1,000 images for training, and 100 for validation, 
which are denoted as VFR_syn_train and VFR_syn_val, re¬ 
spectively. The entire AdobeVFR dataset, consisting of 
VFR_real _test, VFR_reaLin VFR_syn_train and VFR_syn_val, 
are made publicly availably 

The AdobeVFR dataset is the first large-scale benchmark 
set consisting of both synthetic and real-world text images, 
for the task of font recognition. To our best knowledge, so 
far VFR_reaLtest is the largest available set of real-world 
text images with reliable font label information (12.5 times 
larger than VFRWild325). The AdobeVFR dataset is super 
fine-grain, with highly subtle categorical variations, leading 
itself to a new challenging dataset for object recognition. 
Moreover, the substantial mismatch between synthetic and 
real-world data makes the AdobeVFR dataset an ideal sub¬ 
ject for general domain adaption and transfer learning re¬ 
search. It also promotes the new problem area of under¬ 
standing design styles with deep learning. 

2.3 Synthetic Data Augmentation: A First Step 
to Reduce the Mismatch 

^http: //www.atlaswang.com/deepfont.html 


Figure 2: (a) the different characters spacings be¬ 
tween a pair of synthetic and real-world images, (b) 
the different aspect ratio between a pair of synthetic 
and real-world image 

Before feeding synthetic data into model training, it is 
popular to artificially augment training data using label¬ 
preserving transformations to reduce overfitting. In [^, the 
authors applied image translations and horizontal reflections 
to the training images, as well as altering the intensities of 
their RGB channels. The authors in added moderate 
distortions and corruptions to the synthetic text images: 

• 1. Noise: a small Gaussian noise with zero mean and 
standard deviation 3 is added to input 

• 2. Blur: a random Gaussian blur with standard de¬ 
viation from 2.5 to 3.5 is added to input 

• 3. Perspective Rotation: a randomly-parameterized 
affine transformation is added to input 

• 4. Shading: the input background is filled with a 
gradient in illumination. 

The above augmentations cover standard perturbations for 
general images, and are adopted by us. However, as a very 
particular type of images, text images have various real- 
world appearances caused by specific handlings. Based on 
the observations in Fig. |^, we identify two additional font- 
specific augmentation steps to our training data: 

• 5. Variable Character Spacing: when rendering 
each synthetic image, we set the character spacing (by 
pixel) to be a Gaussian random variable of mean 10 
and standard deviation 40, bounded by [0, 50]. 

• 6. Variable Aspect Ratio: Before cropping each 
image into a input patch, the image, with heigh fixed, 
is squeezed in width by a random ratio, drawn from a 
uniform distribution between | and |. 

Note that these steps are not useful for the method in 
because it exploits very localized features. However, as we 
show in our experiments, these steps lead to significant per¬ 
formance improvements in our DeepFont system. Overall, 
our data augmentation includes steps 1-6. 

To leave a visual impression, we take the real-world im¬ 
age Fig. [^(a), and synthesize a series of images in Fig. 
all with the same text but with different data augmentation 
ways. Specially, (a) is synthesized with no data augmenta¬ 
tion; (b) is (a) with standard augmentation 1-4 added; (c) 
is synthesized with spacing and aspect ratio customized to 
be identical to those of Fig. |^(a); (d) adds standard aug¬ 
mentation 1-4 to (c). We input images (a)-(d) through the 
trained DeepFont model. For each image, we compare its 
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(e) Relative CNN layer-wise responses 

Figure 3: The effects of data augmentation steps. 
(a)-(d): synthetic images of the same text but with 
different data augmentation ways, (e) compares rel¬ 
ative differences of (a)-(d) with the real-world image 
Fig. (a), in the measure of layer-wise network ac¬ 
tivations through the same DeepFont model. 


layer-wise activations with those of the real image Fig. 

(a) feeding through the same model, by calculating the nor¬ 
malized MSEs. Fig. (e) shows that those augmentations, 
especially the spacing and aspect ratio changes, reduce the 
gap between the feature hierarchies of real-world and syn¬ 
thetic data to a large extent. A few synthetic patches after 
full data augmentation 1-6 are displayed in Fig. It is 
observable that they possess a much more visually similar 
appearance to real-world data. 
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Figure 4: Examples of synthetic training 105 x 105 
patches after pre-processing steps 1-6. 


3. DOMAIN ADAPTED CNN 

3.1 Domain Adaptation by CNN Decomposi¬ 
tion and SCAE 

Despite that data augmentations are helpful to reduce 
the domain mismatch, enumerating all possible real-world 
degradations is impossible, and may further introduce degra¬ 
dation bias in training. In the section, we propose a learning 
framework to leverage both synthetic and real-world data, 
using multi-layer CNN decomposition and SCAE-based do¬ 
main adaptation. Our approach extends the domain adap¬ 
tation method in to extract low-level features that repre¬ 
sent both the synthetic and real-world data. We employs a 
Convolutional Neural Network (CNN) architecture, which is 


further decomposed into two sub-networks: a ’’shared” low- 
level sub-network which is learned from the composite set of 
synthetic and real-world data, and a high-level sub-network 
that learns a deep classifier from the low-level features. 

The basic CNN architecture is similar to the popular Im- 
ageNet structure [^, as in Fig. The numbers along with 
the network pipeline specify the dimensions of outputs of 
corresponding layers. The input is a 105 x 105 patch sam¬ 
pled from a ’’normalized” image. Since a square window may 
not capture sufficient discriminative local structures, and 
is unlikely to catch high-level combinational features when 
two or more graphemes or letters are joined as a singl^lyph 
(e.g., ligatures), we introduce a squeezing operation q that 
scales the width of the height-normalized image to be of a 
constant ratio relative to the height (2.5 in all our experi¬ 
ments). Note that the squeezing operation is equivalent to 
producing “long” rectangular input patches. 

When the CNN model is trained fully on a synthetic dataset, 
it witnesses a significant performance drop when testing on 
real-world data, compared to when applied to another syn¬ 
thetic validation set. This also happens with other models 
such as in [^, which uses training and testing sets of similar 
properties to ours. It alludes to discrepancies between the 
distributions of synthetic and real-world examples, we pro¬ 
pose to decompose the N CNN layers into two sub-networks 
to be learned sequentially: 

• Unsupervised cross-domain sub-network Cu, which 
consists of the first K layers of CNN. It accounts for 
extracting low-level visual features shared by both syn¬ 
thetic and real-world data domains. Cu will be trained 

in a unsupervised way, using unlabeled data from both 
domains. It constitutes the crucial step that further 
minimizes the low-level feature gap, beyond the previ¬ 
ous data augmentation efforts. 

• Supervised domain-specific sub-network Cs, which 
consists of the remaining N — K layers. It accounts for 
learning higher-level discriminative features for classi¬ 
fication, based on the shared features from Cu- Cs 
will be trained in a supervised way, using labeled data 
from the synthetic domain only. 

We show an example of the proposed CNN decomposition in 
Fig. The Cu and Cs parts are marked by red and green 
colors, respectively, with N = 8 and K — 2. Note that the 
low-level shared features are implied to be independent of 
class labels. Therefore in order to address the open-ended 
problem of font classes, one may keep re-using the Cu sub¬ 
network, and only re-train the Cs part. 

Learning Cu from SCAE Representative unsupervised 
feature learning methods, such as the Auto-Encoder and the 
Denoising Auto-Encoder, perform a greedy layer-wise pre¬ 
training of weights using unlabeled data alone followed by 
supervised fine-tuning ([^). However, they rely mostly on 
fully-connected models and ignore the 2D image structure. 

In , a Convolutional Auto-Encoder (CAE) was proposed 
to learn non-trivial features using a hierarchical unsuper¬ 
vised feature extractor that scales well to high-dimensional 
inputs. The CAE architecture is intuitively similar to the 
the conventional auto-encoders in [^, except for that their 

^Note squeezing is independent from the variable aspect ra¬ 
tio operation introduced in Section 2.3, as they are for dif¬ 
ferent purposes. 
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Figure 5: The CNN architecture in the DeepFont system, and its decomposition marked by different colors 
(iV=8, K=2). 


weights are shared among all locations in the input, preserv¬ 
ing spatial locality. CAEs can be stacked to form a deep 
hierarchy called the Stacked Convolutional Auto-Encoder 
(SCAE), where each layer receives its input from a latent 
representation of the layer below. Eig. plots the SCAE 
architecture for our K — 2 case. 
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Figure 6: The Stacked Convolutional Auto-Encoder 
(SCAE) architecture. 

Training Details We first train the SCAE on both syn¬ 
thetic and real-world data in a unsupervised way, with a 
learning rate of 0.01 (we do not anneal it through training). 
Mean Squared Error (MSE) is used as the loss function. Af¬ 
ter SCAE is learned, its Conv. Layers 1 and 2 are imported 
to the CNN in Eig. as the Cu sub-network and fixed. The 
Cs sub-network, based on the output by Cu, is then trained 
in a supervised manner. We start with the learning rate at 
0.01, and follow a common heuristic to manually divide the 
learning rate by 10 when the validation error rate stops de¬ 
creasing with the current rate. The “dropout” technique is 
applied to fc6 and fc7 layers during training. Both Cu and 
Cs are trained with a default batch size of 128, momentum 
of 0.9 and weight decay of 0.0005. The network training is 
implemented using the CUBA ConvNet package [^, and 
runs on a workstation with 12 Intel Xeon 2.67GHz CPUs 
and 1 GTX680 GPU. It takes around 1 day to complete the 
entire training pipeline. 

Testing Details We adopt multi-scale multi-view testing 
to improve the result robustness. Eor each test image, it 
is first normalized to 105 pixels in height, but squeezed in 
width by three different random ratios, all drawn from a 
uniform distribution between 1.5 and 3.5, matching the ef¬ 
fects of squeezing and variable aspect ratio operations during 
training. Under each squeezed scale, five 105 x 105 patches 
are sampled at different random locations. That constitutes 
in total fifteen test patches, each of which comes with dif¬ 
ferent aspect ratios and views, from one test image. As 
every single patch could produce a softmax vector through 
the trained CNN, we average all hfteen softmax vectors to 
determine the final classification result of the test image. 


3.2 Connections to Previous Work 

We are not the first to look into an essentially “hierar¬ 
chical” deep architecture for domain adaption. In [^, the 
proposed transfer learning approach relies on the unsuper¬ 
vised learning of representations. Bengio et. al hypothesized 
in that more levels of representation can give rise to more 
abstract, more general features of the raw input, and that 
the lower layers of the predictor constitute a hierarchy of 
features that can be shared across variants of the input 
distribution. The authors in used data from the union 
of all domains to learn their shared features, which is dif¬ 
ferent from many previous domain adaptation methods that 
focus on learning features in a unsupervised way from the 
target domain only. However, their entire network hierarchy 
is learned in a unsupervised fashion, except for a simple lin¬ 
ear classier trained on top of the network, i.e., K — N — 1. 
In [^, the CNN learned a set of filters from raw images 
as the first layer, and those low-level filters are fixed when 
training higher layers of the same CNN, i.e., K — 1. In 
other words, they either adopt a simple feature extractor 
{K — 1), or apply a shallow classifier [K — N — 1). Our 
CNN decomposition is different from prior work in that: 

• Our feature extractor Cu and classier Cs are both 
deep sub-networks with more than one layer (both K 
and N — K are larger than 1), which means that both 
are able to perform more sophisticated learning. More 
evaluations can be found in Section 5.2. 

• We learn “shared-feature” convolutional filters rather 
than fully-connected networks such as in [^ , the former 
of which is more suitable for visual feature extractions. 

The domain mismatch between synthetic and real-world data 
on the lower-level statistics can occur in more scenarios, 
such as real-world face recognition from rendered images or 
sketches, recognizing characters in real scenes with synthetic 
training, human pose estimation with synthetic images gen¬ 
erated from 3D human body models. We conjecture that 
our framework can be applicable to those scenarios as well, 
where labeled real-world data is scarce but synthetic data 
can be easily rendered. 

4. LEARNING-BASED MODEL COMPRES¬ 
SION 

The architecture in Eig. [^contains a huge number of pa¬ 
rameters. It is widely known that the deep models are heav¬ 
ily over-parameterized and thus those parameters can be 
compressed to reduce storage by exploring their structure. 





















For a typical CNN, about 90% of the storage is taken up 
by the dense connected layers, which shall be our focus for 
mode compression. 

One way to shrink the number of parameters is using ma¬ 
trix factorization [^. Given the parameter W G we 

factorize it using singular-value decomposition (SVD): 

W = USV^ (1) 

where f/ G and V G are two dense orthogonal 

matrices and S G RJ^^^ is a diagonal matrix. To restore an 
approximate W , we can utilize U, V and S, which denote 
the submatrices corresponding to the top k singular vectors 
in U and V along with the top k eigenvalue in S: 

W = USV^ (2) 

The compression ratio given m, n, and k is , which 

is very promising when m,n ^ k. However, the approxima¬ 
tion of SVD is controlled by the decay along the eigenvalues 
in S. Even it is verified in Fig. [^that eigenvalues of weight 
matrices usually decay fast (the 6-th largest eigenvalue is 
already less than 10% of the largest one in magnitude), the 
truncation inevitably leads to information loss, and potential 
performance degradations, compared to the uncompressed 
model. 



Figure 7: The plots of eigenvalues for the fc6 layer 
weight matrix in Fig. This densely connected 

layer takes up 85% of the total model size. 

Instead of first training a model then lossy-compressing 
its parameters, we propose to directly learn a losslessly 
compressible model (the term “lossless” is referred as there 
is no further loss after a model is trained). Assuming the 
parameter matrix IF of a certain network layer, our goal is to 
make sure that its rank is exactly no more than a small 
constant k. In terms of implementation, in each iteration, 
an extra hard thresholding operation is executed on IF 
after it is updated by a conventional back propagation step: 


5. EXPERIMENTS 

5.1 Analysis of Domain Mismatch 

We first analyze the domain mismatch between synthetic 
and real-world data, and examine how our synthetic data 
augmentation can help. First we define five dataset varia¬ 
tions generated from VFR_syn_train and VFR_reaLu. These 
are denoted by the letters N, S, F, R and FR and are ex¬ 
plained in Table 

We train five separate SCAEs, all of the same architecture 
as in Fig. using the above five training data variants. The 
training and testing errors are all measured by relative MSEs 
(normalized by the total energy) and compared in Table I. 
The testing errors are evaluated on both the unaugmented 
synthetic dataset N and the real-world dataset R. Ideally, 
the better the SCAE captures the features from a domain, 
the smaller the reconstruction error will be on that domain. 

As revealed by the training errors, real-world data con¬ 
tains rich visual variations and is more difficult to fit. The 
sharp performance drop from N to R of SCAE N indicates 
that the convolutional features for synthetic and real data 
are quite different. This gap is reduced in SCAE S, and fur¬ 
ther in SCAE F, which validates the effectiveness of adding 
font-specific data augmentation steps. SCAE R fits the real- 
world data best, at the expense of a larger error on N. SCAE 
FR achieves an overall best reconstruction performance of 
both synthetic and real-world images. 

Fig. E shows an example patch from a real-world font 
image of highly textured characters, and its reconstruction 
outputs from all five models. The gradual visual variations 
across the results confirm the existence of a mismatch be¬ 
tween synthetic and real-world data, and verify the benefit 
of data augmentation as well as learning shared features. 


(a) original (b) SCAE N (c) SCAE S 



(d) SCAE F (e) SCAE R (f) SCAE FR 


Figure 8: A real-world patch, and its reconstruction 
results from the five SCAE models. 


Wk = UTk{S)V^ (3) 

where Tk will keep the largest k eigenvalues in S while set¬ 
ting others to zeros. Wk is best rank-Zc approximation of 
IF, as similarly in However, different from Q, the 

proposed method incorporates low-rank approximation into 
model training and jointly optimize them as a whole, guar¬ 
anteeing a rank-Zc weight matrix that is ready to be com¬ 
pressed losslessly by applying 0 - Note there are other alter¬ 
natives, such as vector quantization methods [^, that have 
been applied to compressing deep models with appealing 
performances. We will investigate utilizing them together 
to further compress our model in the future. 


5.2 Analysis of Network Structure 

Fixing Network Depth N, Given a fixed network com¬ 
plexity (N layers), one may ask about how to best decom¬ 
pose the hierarchy to maximize the overall classification per¬ 
formance on real-world data. Intuitively, we should have 
sufficient layers of lower-level feature extractors as well as 
enough subsequent layers for good classification of labeled 
data. Thus, the depth K of Cu should neither be too small 
nor too large. 

Table shows that while the classification training error 
increases with K, the testing error does not vary monoton- 
ically. The best performance is obtained with iG = 2 (3 








Table 2: Comparison of Training and Testing Errors (%) of Five SCAEs {K = 2) 


Methods 

Training Data 

Train 

Test 

N 

R 

SCAE N 

N: VFR syn train, no data augmentation 

0.02 

3.54 

31.28 

SCAE S 

S: VFR syn train, standard augmentation 1-4 

0.21 

2.24 

19.34 

SCAE F 

F: VFR syn train, full augmentation 1-6 

1.20 

1.67 

15.26 

SCAE R 

R:VFR reaLu, real unlabeled dataset 

9.64 

5.73 

10.87 

SCAE FR 

FR: Combination of data from F and R 

6.52 

2.02 

14.01 


Table 3: Top-5 Testing Errors (%) for Different Table 4: Top-5 Testing Errors (%) for Different 

CNN Decompositions (Varying K, N = 8) __^CNN Decompositions (Varying K, N = K 6) 


K 

0 

1 

2 

3 

4 

5 

Train 

8.46 

9.88 

11.23 

12.54 

15.21 

17.88 

VFR_reaLtest 

20.72 

20.31 

18.21 

18.96 

22.52 

25.97 


K 

1 

2 

3 

4 

Train 

11.46 

11.23 

10.84 

10.86 

VFR_reaLtest 

21.58 

18.21 

18.15 

18.24 



(a) (b) A=2 (c) A=4 (d) A=5 


Figure 9: The reconstruction results of a real-world 
patch using SCAE FR, with different K values. 

slightly worse), where smaller or larger values of K give sub¬ 
stantially worse performance. When K = 5, ah layers are 
learned using SCAE, leading to the worst results. Rather 
than learning ah hidden layers by unsupervised training, as 
suggested in and other DL-based transfer learning work, 
our CNN decomposition reaches its optimal performance 
when higher-layer convolutional hlters are still trained by 
supervised data. A visual inspection of reconstruction re¬ 
sults of a real-world example in Fig. using SCAE FR with 
different K values, shows that a larger K causes less informa¬ 
tion loss during feature extraction and leads to a better re¬ 
construction. But in the meantime, the classihcation result 
may turn worse since noise and irrelevant high frequency de¬ 
tails (e.g. textures) might hamper recognition performance. 
The optimal A =2 corresponds to a proper “content-aware” 
smoothening, hltering out “noisy” details while keeping rec¬ 
ognizable structural properties of the font style. 

Fixing Cs or Cu Depth. We investigate the inhuences of 
K (the depth of Cu) when the depth of Cs (e.g. N — K) 
keeps hxed. Table reveals that a deeper Cu contributes 
little to the results. Similar trends are observed when we hx 
K and adjust N (and thus the depth ofCs). Therefore, we 
choose 8, K=2 to be the default setting. 

5.3 Recognition Performances on VFR Datasets 

We implemented and evaluated the local feature embedding- 
based algorithm (LEE) in as a baseline, and include the 
four different DeepFont models as specihed in Table The 
hrst two models are trained in a fully supervised manner on 
F, without any decomposition applied. For each of the later 
two models, its corresponding SCAE (SCAE FR for Deep- 
Font CAE_FR, and SCAE R for DeepFont CAE_R) is hrst 
trained and then exports the hrst two convolutional layers 


to Cu. Ah trained models are evaluated in term of top-1 
and top-5 classihcation errors, on the VFR_syn_val dataset 
for validation purpose. Benehting from large learning ca¬ 
pacity, it is clear that DeepFont models ht synthetic data 
signihcantly better than LFE. Notably, the top-5 errors of 
ah DeepFont models (except for DeepFont CAE_R) reach 
zero on the validation set, which is quite impressive for such 
a hne-grain classihcation task. 

We then compare DeepFont models with LFE on the orig¬ 
inal VFRWild325 dataset in |^. As seen from Table[^ while 
DeepFont S hts synthetic training data best, its performance 
is the poorest on real-world data, showing a severe over- 
htting. With two font-specihc data augmentations added 
in training, the DeepFont F model adapts better to real- 
world data, outperforming LFE by roughly 8% in top-5 er¬ 
ror. An additional gain of 2% is obtained when unlabeled 
real-world data is utilized in DeepFont CAE_FR. Next, the 
DeepFont models are evaluated on the new VFR_reaLtest 
dataset, which is more extensive in size and class coverage. 
A large margin of around 5% in top-1 error is gained by 
DeepFont CAE_FR model over the second best (DeepFont 
F), with its top-5 error as low as 18.21%. We will use Deep¬ 
Font CAE_FR as the default DeepFont model. 

Although SCAE R has the best reconstruction result on 
real-world data on which it is trained, it has large training 
and testing errors on synthetic data. Since our supervised 
training relies fully on synthetic data, an effective feature 
extraction for synthetic data is also indispensable. The er¬ 
ror rates of DeepFont CAE_R are also worse than those of 
DeepFont CAE_FR and even DeepFont F on the real-world 
data, due to the large mismatch between the low-level and 
high-level layers in the CNN. 



Figure 10: Failure VFR examples using DeepFont. 
































Another interesting observation is that all methods get 
similar top-5 errors on VFRWild325 and VFR_reaLtest, show¬ 
ing their statistical similarity. However, the top-1 errors of 
DeepFont models on VFRWild325 are significantly higher 
than those on VFR_reaLtest, with a difference of up to 10%. 
In contrast, the top-1 error of LFE rises more than 13% on 
VFR_reaLtest than on VFRWild325. For the small VFR- 
Wild325, the recognition result is easily affected by “bad” 
examples (e.g, low resolution or highly compressed images) 
and class bias (less than 4% of all classes are covered). On 
the other hand, the larger VFR_reaLtest dataset dilutes the 
possible effect of outliers, and examines a lot more classes. 
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Figure 11: Examples of the font similarity. For each 
one, the top is the query image, and the renderings 
with the most similar fonts are returned. 

Fig. [^lists some failure cases of DeepFont. For example, 
the top left image contains extra “fluff” decorations along 
text boundaries, which is nonexistent in the original fonts, 
that makes the algorithm incorrectly map it to some “artis¬ 
tic” fonts. Others are affected by 3-D effects, strong obsta¬ 
cles in foreground, and in background. Being considerably 


difficult to be adapted, those examples fail mostly because 
there are neither specific augmentation steps handling their 
effects, nor enough examples in VFR_reaLu to extract cor¬ 
responding robust features. 

5.4 Evaluating Fout Similarity using DeepFout 

There are a variety of font selection tasks with different 
goals and requirements. One designer may wish to match a 
font to the style of a particular image. Another may wish 
to find a free font which looks similar to a commercial font 
such as Helvetica. A third may simply be exploring a large 
set of fonts such as Adobe TypeKit or Google Web Fonts. 
Exhaustively exploring the entire space of fonts using an 
al phab etical listing is unrealistic for most users. The authors 
in proposed to select fonts based on online crowdsourced 
attributes, and explore font similarity^ from which a user 
is enabled to explore other visually similar fonts given a 
specific font. The font similarity measure is very helpful for 
font selection, organization, browsing, and suggestion. 

Based on our DeepFont system, we are able to build up 
measures of font similarity. We use the 4096 x 1 outputs of 
the fc7 layer as the high-level feature vectors describing font 
visual appearances. We then extract such features from all 
samples in VFR_syn_val Dataset, obtaining 100 feature vec¬ 
tors per class. Next for each class, the 100 feature vectors 
is averaged to a representative vector. Finally, we calculate 
the Euclidean distance between the representative vectors of 
two font classes as their similarity measure. Visualized ex¬ 
amples are demonstrated in Fig. m For each example, the 
top is the query image of a known font class; the most simi¬ 
lar fonts obtained by the font similarity measures are sorted 
below. Note that although the result fonts can belong to 
different font families from the query, they share identifiable 
visual similarities by human perception. 

Although not numerically verified as in , the DeepFont 
results are qualitatively better when we look at the top-10 
most similar fonts for a wide range of query fonts. The 
authors of agree per personal communication with us. 

5.5 DeepFont Model Compression 

Since the fc6 layer takes up 85% of the total model size, we 
first focus on its compression. We start from a well-trained 
DeepFont model (DeepFont CAE_FR), and continue tuning 
it with the hard thresholding applied to the fc6 parame¬ 
ter matrix W in each iteration, until the training/validation 
errors reach the plateau again. 

Table compares the DeepFont models compressed us¬ 
ing conventional matrix factorization (denoted as the “lossy” 
method), and the proposed learning based method (denoted 
as the “lossless” method), under different compression ratios 
(fc6 and total size counted by parameter numbers). The 
last column of Table lists the top-5 testing errors (%) on 
VFR_reaLtest. We observe a consistent margin of the “loss¬ 
less” method over its “lossy” counterpart, which becomes 
more significant when the compression ratio goes low (more 
than 1% when /c = 5). Notably, when k — 100, the pro¬ 
posed “lossless” compression suffers no visible performance 
loss, while still maintaining a good compression ratio of 5.79. 

In practice, it takes around 700 megabytes to store all the 
parameters in our uncompressed DeepFont model, which is 
quite huge to be embedded or downloaded into most cus¬ 
tomer softwares. More aggressively, we reduce the output 
sizes of both fc6 and fc7 to 2048, and further apply the pro- 




















Table 5: Comparison of Training and Testing Errors on Synthetic and Real-world Datasets (%) 


Methods 

Training Data 

Training 

Error 

VER_syn_val 

VERWild325 

VER_reaLtest 

Cu 

Cs 

Top-1 

Top-5 

Top-1 

Top-5 

Top-1 

Top-5 

LEE 

/ 

/ 

/ 

26.50 

6.55 

44.13 

30.25 

57.44 

32.69 

DeepEont S 

/ 

E 

0.84 

1.03 

0 

64.60 

57.23 

57.51 

50.76 

DeepEont E 

/ 

E 

8.46 

7.40 

0 

43.10 

22.47 

33.30 

20.72 

DeepEont CAE_ER 

ER 

E 

11.23 

6.58 

0 

38.15 

20.62 

28.58 

18.21 

DeepEont CAE_R 

R 

E 

13.67 

8.21 

1.26 

44.62 

29.23 

39.46 

27.33 


Table 6: Performance Comparisons of Lossy and 
Lossless Compression Approaches _ 



fc6 size 

Total size 

Ratio 

Method 

Error 

default 

150 , 994,944 

177 , 546,176 

NA 

NA 

18.21 

k=5 

204,805 

26 , 756,037 

6.64 

Lossy 

20.67 

Lossless 

19.23 

k=10 

409,610 

26 , 960,842 

6.59 

Lossy 

19.25 

Lossless 

18.87 

k=50 

2 , 048,050 

28 , 599,282 

6.21 

Lossy 

19.04 

Lossless 

18.67 

k=100 

4 , 096,100 

30 , 647,332 

5.79 

Lossy 

18.68 

Lossless 

18.21 


posed compression method {k = 10) to the fc6 parameter 
matrix. The obtained “mini” model, with only 9, 477, 066 
parameters and a high compression ratio of 18.73, becomes 
less than 40 megabytes in storage. Being portable even on 
mobiles. It manages to keep a top-5 error rate around 22%. 

6. CONCLUSION 

In the paper, we develop the DeepFont system to remark¬ 
ably advance the state-of-the-art in the VFR task. A large 
set of labeled real-world data as well as a large corpus of un¬ 
labeled real-world images is collected for both training and 
testing, which is the first of its kind and will be made pub¬ 
licly available soon. While relying on the learning capacity 
of CNN, we need to combat the mismatch between available 
training and testing data. The introduction of SCAE-based 
domain adaption helps our trained model achieve a higher 
than 80% top-5 accuracy. A novel lossless model compres¬ 
sion is further applied to promote the model storage effi¬ 
ciency. The DeepFont system not only is effective for font 
recognition, but can also produce a font similarity measure 
for font selection and suggestion. 
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